burst prediction
I/O Burst Prediction for HPC Clusters using Darshan Logs
Saeedizade, Ehsan, Taheri, Roya, Arslan, Engin
Understanding cluster-wide I/O patterns of large-scale HPC clusters is essential to minimize the occurrence and impact of I/O interference. Yet, most previous work in this area focused on monitoring and predicting task and node-level I/O burst events. This paper analyzes Darshan reports from three supercomputers to extract system-level read and write I/O rates in five minutes intervals. We observe significant (over 100x) fluctuations in read and write I/O rates in all three clusters. We then train machine learning models to estimate the occurrence of system-level I/O bursts 5 - 120 minutes ahead. Evaluation results show that we can predict I/O bursts with more than 90% accuracy (F-1 score) five minutes ahead and more than 87% accuracy two hours ahead. We also show that the ML models attain more than 70% accuracy when estimating the degree of the I/O burst. We believe that high-accuracy predictions of I/O bursts can be used in multiple ways, such as postponing delay-tolerant I/O operations (e.g., checkpointing), pausing nonessential applications (e.g., file system scrubbers), and devising I/O-aware job scheduling methods. To validate this claim, we simulated a burst-aware job scheduler that can postpone the start time of applications to avoid I/O bursts. We show that the burst-aware job scheduling can lead to an up to 5x decrease in application runtime.
Cracks Under Pressure? Burst Prediction in Water Networks Using Dynamic Metrics
Kaushik, Gollakota (Tata Consultancy Services) | Manimaran, Abinaya (Tata Consultancy Services) | Vasan, Arunchandar (Tata Consultancy Services) | Sarangan, Venkatesh (Tata Consultancy Services) | Sivasubramaniam, Anand (Penn State University)
Ranking pipes according to their burst likelihood can help a water utility triage its proactive maintenance budget effectively. In the research literature, data-driven approaches have been used recently to predict pipe bursts. Such approaches make use of static features of the individual pipes such as diameter,length, and material to estimate burst likelihood for the next year by learning over past historical data. The burst likelihood of a pipe also depends on dynamic features such as its pressure and flow. Existing works ignore dynamic features because the features need to be measured or are difficult to obtain accurately using a well-calibrated hydraulic model. We complement prior data-driven approaches by proposing a methodology to approximately estimate the dynamic features of individual pipes from readily available network structure and other data. We study the error introduced by our approximation on an academic benchmark water network with ground truth. Using a real-world pipe burst dataset obtained from a European water utility for multiple years, we show that our approximate dynamic features improve the ability of machine learning classifiers to predict pipe bursts. The performance (as measured by the percentage of future bursts predicted) of the best forming classifier improves by nearly 50% through these dynamic features.